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Abstract 

Perceptron is a classic online algorithm for learning a classification function. In this paper, we 
provide a novel extension of the perceptron algorithm to the learning to rank problem in informa¬ 
tion retrieval. We consider popular listwise performance measures such as Normalized Discounted 
Cumulative Gain (NDCG) and Average Precision (AP). We propose a novel family of listwise, 
large margin ranking surrogates, which are adaptable to NDCG and AP measures and derive a 
perceptron-like algorithm using these surrogates. Exploiting a self-bounding property of the pro¬ 
posed surrogates, we provide a guarantee on the cumulative NDCG (or AP) induced loss incurred 
by our perceptron-like algorithm. We show that, if there exists a perfect oracle ranker which can 
correctly rank, with some margin, each instance in an online sequence, the cumulative NDCG (or 
AP) induced loss of perceptron algorithm on that sequence is bounded by a constant, irrespective of 
the length of the sequence. This result is a learning to rank analogue of Novikoff’s convergence the¬ 
orem for the classification perceptron. However, our perceptron like algorithm for learning to rank 
has two drawbacks. First, unlike classification perceptron, the prediction at each round depends on 
a learning rate parameter. Second, the perceptron loss bound does not match our established lower 
bound on the cumulative loss achievable by any deterministic online algorithm. We propose a sec¬ 
ond perceptron like algorithm which achieves the lower bound and is independent of the learning 
rate parameter. However, our second algorithm does not adapt to different ranking measures, does 
not possess the listwise property and does not perform well on real world datasets. Experiments on 
simulated datasets corroborate our theoretical results and demonstrate competitive performance on 
large industrial benchmark datasets. 


1. Introduction 


Learning to rank [ LiutpOlT) is a supervised learning problem where the output space consists of 
rankings of a set of objects. In the learning to rank problem that frequently arises in information 
retrieval, the objective is to rank documents associated with a query, in the order of the relevance of 
the documents for the given query. The accuracy of a ranked list, given actual relevance scores of the 
documents, is measured by various ranking performance measures, such as Normalized Discounted 
Cumulative Gain (NDCG) | Jarvelin and Kekalainen[ 2002) and Average Precision (AP) |Baeza- 
Yates and Ribeiro-Neto 1999) . Since optimization of ranking measures during the training phase is 
computationally intractable, ranking methods are often based on minimizing surrogate losses that 
are easy to optimize. 
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The historical importance of the perceptron algorithm in the classification literature is immense 
I Rosenblatt 1958 Freund and Schapire| 1999) . Classically the perceptron algorithm was not linked 
to surrogate minimization but the modern perspective on perceptron is to interpret it as online gradi¬ 
ent descent (OGD), during mistake rounds, on the hinge loss function 1 Shalev-Shwartz 20111 . The 
hinge loss has special properties that allow one to establish bounds on the cumulative zero-one loss 
(viz., the total number of mistakes) in classification, without making any statistical assumptions on 
the data generating mechanism. Novikoff’s celebrated result | |Novikoff| 1962) about the perceptron 
says that, if there is a perfect linear classification function which can correctly classify, with some 
margin, every instance in an online sequence, then the total number of mistakes made by perceptron, 
on that sequence, is bounded. Moreover, unlike the standard OGD algorithm, the performance of 
perceptron is independent of learning rate parameter, which is of significant advantage due to not 
having to learn the optimal parameter value. 

Our work provides a novel extension of the perceptron algorithm to the learning to rank setting 
with a focus on two listwise ranking measures, NDCG and AR Listwise measures are so named 
because the quality of ranking function is judged on an entire list of document, associated with a 
query, usually with an emphasis to avoid errors near top of the ranked list. Specifically, we make 
the following contributions in this work. 


• We develop a family of listwise large margin ranking surrogates. The family consists of Lip- 
schitz functions and is parameterized by a set of weight vectors that makes the surrogates 
adaptable to losses induced by performance measures NDCG and AP. The family of surro¬ 
gates is an extension of the hinge surrogate in classification that upper bounds the 0-1 loss. 
The family of surrogates has a special self-bounding property: the norm of the gradient of a 
surrogate can be bounded by the surrogate loss itself. 


• We exploit the self bounding property of the surrogates to develop an online perceptron- 
like algorithm for learning to rank (Algorithm [^. We provide bounds on the cumulative 
NDCG and AP induced losses (Theoremj^. We prove that, if there is a perfect linear ranking 
function which can rank correctly, with some margin, every instance in an online sequence, 
our perceptron-like algorithm perfectly ranks all but a finite number of instances (Corollary]^. 
This implies that the cumulative loss induced by NDCG or AP is bounded by a constant, 
and our result can be seen as an extension of the classification perceptron mistake bound 
(Theorem [T}. The performance of our perceptron algorithm, however, is dependent on a 
learning rate parameter, which is a disadvantage over classification perceptron. Moreover, the 
bound depends linearly on the number of documents per query. In practice, during evaluation, 
NDCG is often cut off at a point which is much smaller than number of documents per query. 
In that scenario, we prove that the cumulative NDCG loss of our perceptron is upper bounded 
by a constant which is dependent only on the cut-off point. (Theoremj^. 

• We prove a lower bound, on the cumulative loss induced by NDCG or AP, that can be achieved 

by any deterministic online algorithm (Theorem under a separability assumption. The 
lower bound is independent of the number of documents per query. We propose a second 
perceptron like algorithm (Algorithm which achieves the lower bound (Theorem [T0|), with 

performance being independent of learning rate parameter. However, the surrogate on which 
the perceptron type algorithm operates is not listwise in nature and does not adapt to different 
performance measures. Thus, its empirical performance on real data is significantly worse 
than the first perceptron algorithm (Algorithm]^. 
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• We provide empirical results on simulated as well as large scale benchmark datasets and 
compare the performance of our perceptron algorithm with the online version of the widely 
used ListNet learning to rank algorithm [Cao et aL||2007[ . 

The rest of the paper is organized as follows. Sectionj^provides formal definitions and notations 
related to the problem setting. Section|^provides a review of perceptron for classification, including 
algorithm and theoretical analysis. Section|^ introduces the family of listwise large margin ranking 
surrogates, and contrasts our surrogates with a number of existing large margin ranking surrogates in 
literature. Section [^introduces the perceptron algorithm for learning to rank, and discusses various 
aspects of the algorithm and the associated theoretical guarantee. Section establishes a lower 
bound on NDCG/AP induced cumulative loss and introduces the second perceptron like algorithm. 
Section |7] compares our work with existing perceptron algorithms for ranking. Section [^provides 
empirical results on simulated and large scale benchmark datasets. 

2. Problem Definition 

In learning to rank, we formally denote the input space as X Each input consists of 

m rows of document-query features represented as d dimensional vectors. Each input corresponds 
to a single query and, therefore, the m rows have features extracted from the same query but m 
different documents. In practice m changes from one input instance to another but we treat m as a 
constant for ease of presentation. Eor X € X, X = (xi,..., Xm)~^, where Xi G is the feature 
extracted from a query and the ith document associated with that query. The supervision space is 
3^ C {0,1,..., n}'”, representing relevance score vectors. If n = 1, the relevance vector is binary 
graded. Eor n > 1, relevance vector is multi-graded. Thus, for R € y, R = (i?i,... ,Rm)~^, 
where Ri denotes relevance of fth document to a given query. Hence, R represents a vector and Ri, 
a scalar, denotes fth component of vector. Also, relevance vector generated at time t is denoted Rt 
with zth component denoted *. 

The objective is to learn a ranking function which ranks the documents associated with a query 
in such a way that more relevant documents are placed ahead of less relevant ones. The prevalent 
technique is to learn a scoring function and obtain a ranking by sorting the score vector in descending 
order. Eor X € X, a linear scoring function is fw{X) = X ■ w = ^ M'”, where w G 

The quality of the learnt ranking function is evaluated on a test query using various performance 
measures. We use two of the most popular performance measures in our paper, viz. NDCG and AP. 

NDCG, cut off at A: < m for a query with m documents, with relevance vector R and score 
vector s induced by a ranking function, is defined as follows: 

k 

NDCGfc(s,i?) = (1) 

Shorthand representation of NDCGfc(s, i?) is NDCG^. Here, G{r) = 2'' — 1, D{i) = 

Zk{R) = max J2i=i G{RTr{i))D{i). Eurther, Sm represents the set of permutations over m objects. 

TTs = argsort(s) is the permutation induced by sorting score vector s in descending order (we use 
TTs and argsort(s) interchangeably). A permutation tt gives a mapping from ranks to documents 
and 7r“^ gives a mapping from documents to ranks. Thus, 7r(f) = j means document j is placed 
at position i while vr“^(z) = j means document i is placed at position j. Eor k = m, we denote 
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NDCGm(s,^) as NDCG(s,i2). The popular performance measure, Average Precision (AP), is 
defined only for binary relevance vector, i.e., each component can only take values in {0,1}: 

APis,R) = - V = (2) 

r . 7 

where r = ||-R||i is the total number of relevant documents. 

All ranking performances measures are actually gains. When we say “NDCG induced loss”, 
we mean a loss function that simply subtracts NDCG from its maximum possible value, which is 1 
(same for AP). 


3. Perceptron for Classification 


We will first briefly review the perceptron algorithm for classification, highlighting the modem 
viewpoint that it executes online gradient descent (OGD) | Zinkevich 2003) on hinge loss during 
mistake rounds and achieves a bound on total number of mistakes. This will allow us to directly 
compare and contrast our extension of perceptron to the learning to rank setting. For more details, 
we refer the reader to the survey written by Shalev-Shwartz| [ 2011[ Section 3.3]. 

In classification, an instance is of the form a: G and corresponding supervision (label) is 
y G {—1,1}. A linear classifier is a scoring function gw{-), parameterized by ru G producing 
score gui{x) = x • tu = s G M. Classification of x is obtained by using “sign” predictor on s, i.e., 
sign(s) G { — 1,1}. The loss is of the form: {x,y)) = l[sign(x • w) / y]. The hinge loss is 

defined as: {x, y)) = [1 — y{x ■ ru)]+, where [a]+ = maxjO, a}. 

The perceptron algorithm operates on the loss ft{w), defined on a sequence of data {xt, yt}t>i, 
produced by an adaptive adversary as follows: 


f [l-yt{xfw)]+ ii£{wt,ixt,yt)) = I 
\ 0 ixt,yt)) = 0 


where wt is the learner’s move in round t. It is important to understand the concept of the loss ft{-) 
and adaptive adversary here. An adaptive adversary is allowed to choose ft at round t based on the 
moves of the perceptron algorithm (Algorithm [T]l upto that round. Once the learner fixes its choice 
Wt at the end of step t — 1, the adversary decides which function to play. It is either [1 — yt{xt ■ m)]+ 
or 0, depending on whether £{wt, (xt, yt)) is I or 0 respectively. Notice that ft{w) is convex in both 
cases. 

The perceptron updates a classifier gwti') (effectively updates wt), in an online fashion. The 
update occurs by application of OGD on the sequence of functions ft{w) in the following way: 
perceptron initializes rui = 0 and uses update rule wt+i = wt — gzt, where zt G dft{wt) (zt is 
a subgradient) and g is the learning rate (the importance of g will be discussed at the end of the 
section). If ^{wt, {xt,yt)) = 0, then ft{wt) = 0; hence zt = 0. Otherwise, zt = -ytXt G dft{wt). 
Thus, 


f Wt if (.{wt,{xt,yt)) = Q 

\wt + gytxt ifi{wt,{xt,yt)) = 1- 


( 4 ) 


The perceptron algorithm for classification is described below: 
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Algorithm 1 Perceptron Algorithm for Classification 

Learning rate rj > 0, wi = 0 ^ 

For t = 1 to T 
Receive xt. 

Predict pt = sign(xt • wt). 

Receive yt 

ue{wt, {xt,yt)) 7 ^ 0 

wt+i = wt + Tjytxt 

else 

Wt+l = Wt 

End For 


Theorem 1, Suppose that the perceptron for classification algorithm runs on an online sequence of 
data {{xi,yi ),..., {xt, Vt)} let = max^ ||xi|| 2 . Let ft{-) be defined as in Eg. For all 
G and setting r? = —, the perceptron mistake bound is: 


T 


T 


T 


'^i{wt,ixt,yt)) < '^ft{u) +Rx 

t=l t=l 



ft{u) + Rl\\u\\l 

t=i 


In the special case where there exists u s.t. ft{u) = 0, V f, we have 


(5) 


T 

yT,Y,K^t,{xt,yt))<Rl\\u\\l ( 6 ) 

t=i 

As can be clearly seen from Eq.|^ the cumulative loss bound (i.e., total number of mistakes over 
T rounds) is upper bounded in terms of the cumulative sum of the functions /*(•). In the special 
case where there exists a perfect linear classifier with margin, Eq. shows that the total number of 
mistakes is bounded, regardless of the number of instances. 

One drawback of the bound in Eq.j^is that the concept of margin is not explicit, i.e., it is hidden 
in the norm of the parameter of the perfect classifier (||ri|| 2 )- Eet us assume that there is a linear 
classifier parameterized by a unit norm vector tt*, such that all instances xt are not only correctly 
classified, but correctly classified with a margin 7, defined as: 


ytixf U^) >'1, y T (7) 

It is easy to see that the scaled vector u = ri*/ 7 , whose norm is 1 / 7 ^, will satisfy ft{u) = 0 for all 
t. Therefore, we have following corollary. 

Corollary 2. If the margin condition Q holds, then total number of mistakes is upper bounded by 

r2 

-f-, a bound independent of the number of instances in the online sequence. 

T 

Importance of learning rate parameter 7 : The prediction at round t is pt = sign(xt • wt). 
Eet fAt indicate the rounds, up to time point t — I, where perceptron made a mistake. Starting 
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from wi = 0, unraveling wt, we get pt = sign(^^g^^ 77 xt ■ {viXi)). It can be easily seen that pt 
is invariant to value of p, for p > 0. Hence, the actual performance of the perceptron algorithm 
(in terms of total number of mistakes) is independent of learning rate p and thus, p = 1 can be 
fixed from the beginning of the algorithm. The reason for including p in the algorithm is that in the 
subsequent analysis (Theorem [^, the perceptron loss bound uses standard regret analysis of OGD, 
where the optimal regret bound is established by optimizing over learning rate p. So, though the 
performance is actually independent of p, optimization over p is necessary to establish the optimal 
theoretical upper bound on the loss. 

4. A Novel Family of Listwise Surrogates 

We define fhe novel SLAM family of loss funchons: fhese are Surrogafe, Large margin, Lisfwise 
and Lipschifz losses, Adapfable fo mulfiple performance measures, and can handle Mulfiple graded 
relevance. For score vector s G and relevance vecfor R ^ y, fhe family of convex loss 
funchons is dehned as: 

m 

Auis, R) = min > ViSi 

YSLAMK : ) ^ 

s.f. <5j > 0, V i, Sj + (5i > A + Sj, if Ri > Rj, V i,j. 

The consfanf A denofes margin and v = (fi,..., Um) is an elemenf-wise non-negafive weighf 
vector. Different vectors v, to be dehned later, yield different members of the SLAM family. Though 
A can be varied for empirical purposes, we hx A = 1 for our analysis. The intuition behind the 
loss setting is that scores associated with more relevant documents should be higher, with a margin, 
than scores associated with less relevant documents. The weights decide how much weight to put 
on the errors. 

The following reformulation of '''ill 1*® useful in later derivations. 
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Lemma 3. For any relevance vector R, the function fsLAAii'^ convex. 

Proof. Claim is obvious from the representation given in Eq. □ 

4.1 Weight Vectors Parameterizing the SLAM Family 

As we stated after Eq. different weight vectors lead to different members of the SEAM family. 
The weight vectors play a crucial role in the subsequent theoretical analysis. We will provide two 
weight vectors, and j-ggyit jj^ upper bounds for AP and NDCG induced losses 

respectively. Eater, we will discuss the necessity of choosing such weight vectors. 

Since the losses in SEAM family is calculated with the knowledge of the relevance vector R, 
for ease of subsequent derivations, we can assume, without loss of generality, that documents are 
sorted according to their relevance levels. Thus, we assume that > i ?2 A • • • > Rm, where Ri 
is the relevance of document i. Note that both and ^j^dcg depend on the relevance vector R but 
we hide that dependence in the notation to reduce clutter. 
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Weight vector for AP loss: Let R G be a binary relevance vector. Let r be the number of 
relevant documents (thus, Ri = R 2 = ■■■ = Rr = I and Rr+i = ... = Rm = 0). We define 
vector G as 


,,AP 


i iff = 1,2 ,..., r 
0 if f = r + 1,..., m. 


( 10 ) 


Weight vector for NDCG loss: 

^NDCG g 


For a given relevance vector R G M™, we define vector 


.NDCG _ G{Ri)D{i) 

* “ z(r) 


l,...,m. 


( 11 ) 


Note: Both weights ensure that vi > V 2 > ... > Vm (since Ri > R 2 > . ■ ■ > Rm)- Using the 
weight vectors, we have the following upper bounds. 

Theorem 4. Let G and g be the weight vectors as defined in Eq. © and 

Eq. ( |11| ) respectively. LetAP{s, R) and NDCG{s, R) be the AP value and NDCG value determined 
by relevance vector R G and score vector s G M™. Then, the following inequalities hold, V s, 

<PslamA^ A'i-— AP{s, R) 

NDCG ^ ^ 

4AslamA^^)'>^-^'DCG{s,R). 

The proof of the theorem is in Appendix [A| 


4.2 Properties of SLAM Family and Upper Bounds 

We discuss some of the properties of SLAM family and related upper bounds. Listwise Nature 
of SLAM Family: The critical property for a surrogate to be considered listwise is that the loss 
must be calculated over the entire list of documents as a whole, with errors at the top penalized 
more than errors at the bottom. Since perfect ranking places the most relevant documents at top, 
errors corresponding to most relevant documents should be penalized more in SLAM in order to 
be considered a listwise family. Both and have the property that the more relevant 

documents get more weight. 

Upper Bounds on NDCG and AP: By Theorem]^ the weight vectors make losses in SLAM family 
upper bounds on NDCG and AP induced losses. The SLAM loss family is analogous to the hinge 
loss in classification. Similar to hinge loss, the surrogate losses of SLAM family are 0 when the 
predicted scores respect the relevance labels (with some margin). The upper bound property will be 
crucial in deriving guarantees for a perceptron-like algorithm in learning to rank. Like hinge loss, 
the upper bounds can possibly be loose in some cases, but, as we show next, the upper bounding 
weights make SLAM family Lipschitz continuous with a small Lipschitz constant. This naturally 
restricts SLAM losses from growing too quickly. Empirically, we will show that the perceptron 
developed based on the SLAM family produce competitive performance on large scale industrial 
datasets. Along with the theory, the empirical performance supports the fact that upper bounds are 
quite meaningful. 

Lipschitz Continuity of SLAM: Lipschitz continuity of an arbitrary loss, £{s,R) w.r.t. s in £2 
norm, means that there is a constant L 2 such that \£{si,R) — £{s 2 ,R)\ < L2IIS1 — 'S2II2. for all 
si, S2 G M"*. By duality, it follows that L 2 > sup ||Vsf’(s, R)\\ 2 . We calculate L 2 as follows: 
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Let bij = > Rj){l + Sj — Sj)}. The sub-gradient of 4>'slaM’ 

^s(l)sLAMi-S’ = YT=i Vi a\ where 


0 G M™ if max bij < 0 

Ofc — Oj G otherwise, with k = argmax bjj 


(13) 


and Oj is a standard basis vector along coordinate i. 

Since ||a*||i < 2, it is easy to see that \\^s(f>sLAMi^^ ^ ^ Vi- Since li norm domi¬ 
nates £2 norm, Lipschitz continuous in £2 norm whenever we can bound Vj- It 

is easy to check that ^ = 1 and = 1. Hence, ti^DCG ^ap j^duce Lipschitz 

continuous surrogates, with Lipschitz constant at most 2. 

Comparison with Surrogates Derived from Structured Prediction Framework: We briefly 
highlight the difference between SLAM and listwise surrogates obtained from the structured pre¬ 
diction framework [Chapelle et al. 2007 Yue et al. 2007 Chakrabarti et al.[ 2008) . Structured 
prediction for ranking models assume that the supervision space is the space of full rankings of a 
document list. Usually a large number of full rankings are compatible with a relevance vector, in 
which case the relevance vector is arbitrarily mapped to a full ranking. In fact, here is a quote from 
one of the relevant papers | Chapelle et al. 2007| , “It is often the case that this yq is not unique and 
we simply take of one of them at random” (pq refers to a correct full ranking pertaining to query q). 
Thus, all but one correct full ranking will yield a loss. In contrast, in SLAM, documents with same 
relevance level are essentially exchangeable (see Eq. Q). Thus, our assumption that documents are 
sorted according to relevance during design of weight vectors is without arbitrariness, and there will 
be no change in the amount of loss when documents within same relevance class are compared. 


5. Perceptron-like Algorithms 


We present a perceptron-like algorithm for learning a ranking function in an online setting, using 
the SLAM family. Since our proposed perceptron like algorithm works for both NDCG and AP 
induced losses, for derivation purposes, we denote a performance measure induced loss as Ranking- 
MeasureLoss (RML). Thus, RML can be NDCG induced loss or AP induced loss. 

Informal Definition: The algorithm works as follows. At time t, the learner maintains a linear 
ranking function, parameterized by wt- The learner receives Xt, which is the document list retrieved 
for query qt and ranks it. Then the ground truth relevance vector Rt is received and ranking function 
updated according to the perceptron rule. 

Let bij = {l{Ri > Rj){l + Sj — Si)}. For subsequent ease of derivations, we write SLAM loss 
from Eq. Q as: (l^sLAMi^"^^ = YT=i U Gr where 


Ci = 


l + s--sr G 


if max bij < 0 

j=l,...,m 

otherwise 
k = argmax bij. 

j=l,...,m 


(14) 


and = Xw€ M”*. 

Eike classification perceptron, our perceptron-like algorithm operates on the loss ft{w), defined 
on a sequence of data {Xt, Rt}t>i, produced by an adaptive adversary (i.e., an adversary who can 
see the learner’s move before making its move) as follows: 





























(15) 


/ 4>^SLAM{^r,Rt) ifRML(sr,i?t)/0 
\0 ifRML(sf ,i?t) = 0 


Here, sf = Xtw and vt = or depending on whether RML is NDCG or AP induced loss. 

Since weight vector v depends on relevance vector R (Eq. ®, (dD), the subscript t in vt denotes 
the dependence on Rt. Moreover, wt is the parameter produced by our perceptron (Algorithm]^ at 
the end of step t — 1, with the adaptive adversary being influenced by the move of perceptron (recall 
Eq.j^and discussion thereafter). 

It is clear from Theorem.and Eq. ( [T5] ) that ft{wt) > RME(s^‘, i?*). It should also be noted 
that that /t(-) is convex in either of the two cases. Thus, we can run the online gradient descent 
(OGD) algorithm [ [Zinkevich 2003) to learn the sequence of parameters wt, starting with wi = 
0. The OGD update rule, wt+i = wt — rjzt, for some zt G dft{wt) and step size r], requires 
a subgradient zt that, in our case, is computed as follows. When RME(s];“*, i?t) = 0, we have 
Zt = 0 ^ When RME(s];"% 22^) / 0, we have 


Zt = ^7 X] ^6* (^t,i 

G 

\i=l ) 


J 0 G 

o 

II 

\ Ofc - Oi G M'" 

if c* 7 ^ 0 


(16) 


where is the standard basis vector along coordinate k and c* G M is as defined in Eq. ( fid] ) (with 

s- = sr = Xtwt). 

We now obtain a perceptron-like algorithm for the learning to rank problem. 


Algorithm 2 Perceptron Algorithm for Eeaming to Rank 

Eearning rate r/ > 0, mi = 0 G M'^. 

Eor t = 1 to T 

Receive Xt (document list for query qt). 

Set = XtWt , predicted ranking output pt= argsort(s];^‘). 

Receive Rt 

If RME(sJ"*, i2t) / 0 //Note: RME(s^‘, i2i) = RME(argsort(sf*), Rt) 
wt+i = Wt — rjzt H Zt is defined in Eq. ( [T^ 

else 

Wt+l = Wt 

End For 


5.1 Bound on Cumulative Loss 

We provide a theoretical bound on the cumulative loss (as measured by RME) of perceptron for 
the learning to rank problem. The technique uses regret analysis of online convex optimization 
algorithms. We state the standard OGD bound used to get our main theorem [ Zinkevich] [2003) . 
An important thing to remember is that OGD guarantee holds for convex functions played by an 
adaptive adversary, which is important for an OGD based analysis of the perceptron algorithm. 
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Proposition (OGD regret). Let ft be a sequence of convex functions. The update rule of function 
parameter is Wt+i = Wt — rjzt, where Zt G dft{wt). Then for any w G the following regret 
bound holds after T rounds, 


T 


T 


t=l t=l 


< 



T 



(17) 


We first control the norm of the subgradient zt, defined in Eq. ( [T^ . To do this, we will need to 
use the p —)■ q norm of matrix. 


Definition (p —)■ q norm). Let A G be a matrix. The p ^ q norm of A is: 




= max 
v^O 


\\Av\U 

ll^llp 


Lemma 5. Let Rx be the bound on the maximum £2 norm of the feature vectors representing the 

documents. Let vt^max = ™ax{ —}, V i,j with vt^i > 0, vtj > 0, and m be bound on number of 
’ i,j ’ ’ 

documents per query. Then we have the following I 2 norm bound, 


V t, \\zt\\l < 4:171 RxVt ,max ft{wt) ■ 


(18) 


Proof. For a mistake round t, we have zt = Xj from Eq. ( [T^ . 

1st bound for zp. 


11^7 a-t,i)\\2 < 11-^7||i-s>2|| ^ vt^i aqilli < ^ vt,i = 

i=l 

The first inequality uses the 1 —2 norm and last inequality holds because = 1 and 

Er=i = 1- 

2nd bound for zt (The self-bounding property of SEAM is being used here, to bound the norm 
of gradient by loss itself): 

We note that in a mistake round, RME(sJ"‘, Rt) / 0. Thus, there is at least 1 pair of documents 
whose ranks are inconsistent with their relevance levels. Mathematically, 

3i',k' s.t. Rt^i, > Rt^k,, sf^., < 

Now, f^sLAui^t' ^ ^t) = X '^t,i 4 (Eq- fill ). For (i', k'), we have c*, > 1 + > 1. 

Since Rt^f > Rt,k', document i' has strictly greater than minimum possible relevance, i.e., 
Rt^i' > 0. By our calculations of weight vector v for both NDCG and AP, we have vt^p > 0. 


Thus, by definition, r;t,max > 1 (since > 0 and ;^ = 1 and r;t,max = max{^}, V i,j 

t,i ij ,J 

with vt,i > 0, vt,j > 0). 
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Then, V i, vt,i < ?;t,max • vt^i' < ft,max • • 4- ThuS’ have: 

m m 

^ ^ Vt,i ^ ft,max fj' — ff ft,max( ^ ^ ft,i ft) — ^ ft,max ^SLAM^^t i 

i=l i=l 

It follows that ||zt||2 < 2Rx Y.i '^t,i < 2Rx m ft,max 4>sLAMi^T ^^t)- 

Combining 1st and 2nd bound for zt, we get \\zt \\2 < A,R\m vt^max ^’^sLAui^t’*^ 
take rounds. 

Since, for non-mistake rounds, we have zt = 0 and ft{wt) = 0, we get the final inequality. 


□ 

T 

Taking max ft,max < Vmax, we have the following theorem, which uses the norm bound on zt'. 

Theorem 6. Suppose Algorithm^receives a sequence of instances {Xi, Ri),, {Xt, Rt) <^nd 
let Rx be the bound on the maximum (.2 norm of the feature vectors representing the documents. 
Then the following inequality holds, after optimizing over learning rate ??, V if G 


T 


T 


Y,RML{sf\Rt) < Y.ft{w) + 

t=i t=i 



T 

^/t(rf) + 4||tf||2mi?^fmaa: • 
t=l 


In the special case where there exists w s.t. ft{w) = 0, V f, we have 


(19) 


T 

^RML{sf\Rt) < 4\\w\\lmRxVmax- ( 20 ) 

t=i 

Proof. The proof follows by plugging in expression for Ikilli (Lemmain OGD equation (Prop. 
OGD Regret), optimizing over q, using the algebraic trick: x — b^/x — c < 0 => x < 6 ^ + c+ h^/c 
and then using the inequality ft{wt) > RML(s];“*, Rt). □ 

Note: The perceptron bound, in Eq. [T^ is a loss bound, i.e., the left hand side is cumulative 
NDCG/AP induced loss while right side is function of cumulative surrogate loss. We discuss in 
details the significance of this bound later. 

Like perceptron for binary classification, the constant in Eq. [T^ needs to be expressed in terms 
of a “margin”. A natural definition of margin in case of ranking data is as follows: let us assume that 
there is a linear scoring function parameterized by a unit norm vector w^, such that all documents 
for all queries are ranked not only correctly, but correctly with a margin 7 : 


T 

~r ~r 

min min Xt^i — Xtj > 7 . 


( 21 ) 


Corollary 7. If the margin condition ( |21| ) holds, then total loss, for both NDCG and AP induced 
loss, is upper bounded by ^ ^ hound independent of the number of instances in the online 

sequence. 
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Proof. Fix a t and the example {Xt, Rt). Set w = u)*/ 7 . For this w, we have 


min w'^Xt i — Xt 7 > 1 , 


which means that 


mm 

t • Rt , i ^ Rt ,j 




— s 




> 1 


This immediately implies that 1 < 0,Vi,y . Therefore, ’^t) = 

0 and hence ftiw) = 0. Since this holds for all t, we have fti^) = 0- 

□ 


5.1.1 Perceptron Bound-General Discussion 


We remind once again that RML(s];“*, Rt) is either 1 — NDCG(s^‘, i?*) or 1—AP(s];^‘, i?*), depend¬ 
ing on measure of interest. 


Importance of learning rate parameter rj: Like the classification perceptron, Algorithm|^also 
has the learning rate parameter rj embedded, and the optimal upper bound on loss is obtained by 
optimizing over rj. However, unlike classification perceptron, the performance is not independent 
of rj. The prediction at each round is the ranking obtained from sorted order of score, i.e., pt = 
argsort(Att(;i). Let M.t indicate the rounds, up to time point t — 1, where the algorithm did not 
produce perfect ranking. Starting from wi = 0, unraveling wt, we ge.t pt = argsort(^^g^^ —riXt ■ 
Zi). Now, had zt been independent of p, then pt, which is the sorted order of score vector, would 
have been independent of scaling factor p > 0. However, each Zi is dependent on Wi implicitly 


(Eq. 161, which themselves are dependent of p (recall for classification perceptron, Zi = —piXi, i.e., 
independent of Wi during mistake round i). To clarify, we consider, during a mistake round, two 
score vector and s^, where = ps^. Had subgradient z, during a mistake round, been indeed 
independent of w (and hence score s = X ■ w), then ^ would have been same for both and s^. 


However, this is not the case. To see this, note that Ci (Eq. 14 1 , for some i, can be 0 for but 
non-zero for s^, depending on value of p, which affects the gradient. 

Dependence of perceptron bound on number of documents per query: The perceptron 
bound in Eq.[T^is meaningful only if Umax is a finite quantity. 

Eor AP, it can be seen from the definition of in Eq. 10 that Umax = 1- Thus, for AP induced 


AmRf 


loss, the constant in the perceptron bound is: 

Eor NDCG, Umax depends on maximum relevance level. Assuming maximum relevance level 
is finite (in practice, maximum relevance level is usually below 5), Umax = 0(log(m)). Thus, for 


NDCG induced loss, the constant in the perceptron bound is: 


4m log(m)R^ 


Significance of perceptron bound; The main perceptron bound is given in Eq. 19 with the 


special case being captured in Corollary [7] At first glance, the bound might seem non-informative 
because the left side is the cumulative NDCG/AP induced loss bound, while the right side is a 
function of the cumulative surrogate loss. 

The first thing to note is that the perceptron bound is derived from the regret bound in Eq. 
which is the well-known regret bound of the OGD algorithm applied to an arbitrary convex, Eips- 
chitz surrogate. So, even ignoring the bound in Eq. the perceptron algorithm is a valid online 
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algorithm, applied to the sequence of convex functions to leam ranking function wt, with a 
meaningful regret bound. Second, as we had mentioned in the introduction, our perceptron bound 
is the extension of perceptron bound in classification, to the cumulative NDCG/AP induced losses 


in the learning to rank setting. This can be observed by noticing the similarity between Eq. 19 


and Eq. In both cases, the the cumulative target loss on the left is bounded by a function of the 
cumulative surrogate loss on the right, where the surrogate is the hinge (and hinge like SEAM) loss. 

The interesting aspects of perceptron loss bound becomes apparent on close investigation of the 
cumulative surrogate loss term Ylt=i ft{w) and comparing with the regret bound. It is well known 
that when OGD is run on any convex, Eipschitz surrogate, the guarantee on the regret scales at 
the rate 0{VT). So, if we only ran OGD on an arbitrary convex, Eipschitz surrogate, then, even 
with the assumption of existence of a perfect ranker, the upper bound on the cumulative loss would 
have scaled as 0{VT). However, in the perceptron loss bound, if Ylt=i — o(T'“), then the 

upper bound on the cumulative loss would scale as 0{T°‘), which can be much better than 
for a < 1/2. In the best case of Ylt=i ft{w) = 0 , the total cumulative loss would be bounded, 
irrespective of the number of instances. 

Comparison and contrast with perceptron for classification; The perceptron for learning 
to rank is an extension of the perceptron for classification, both in terms of the algorithm and the 
loss bound. To obtain the perceptron loss bounds in the learning to rank setting, we had to address 
multiple non-trivial issues, which do not arise in the classification setting. Unlike in classification, 
the NDCG/AP losses are not {0, l}-valued. The analysis is trivial in classification perceptron since 
on a mistake round, the absolute value of gradient of hinge loss is 1 , which is same as the loss itself. 
In our setting, Eemmaj^is crucial, where we exploit the structure of SEAM surrogate to bound the 
square of gradient by the surrogate loss. 


5.1.2 Perceptron Bound Dependent On NDCG Cut-Oef Point 

The bound on the cumulative loss in Eq. ( [T^ is dependent on m, the maximum number of docu¬ 
ments per query. It is often the case in learning to rank that though a list has m documents, the 
focus is on the top k documents {k <C m) in the order sorted by score. The measure used for top-k 
documents is NDCG^ (Eq.[^ (there does not exist an equivalent definition for AP). 

We consider a modified sef of weighfs tiNDCGfc ^ j- R) > 1 — NDCGfc(s, R) holds 

V s, for every R. We provide the definition of later in the proof of Theorer^^. 

Overloading nofafion wifh vt = lef vt^max = max{^^} wifh vt^i >0, vtj >0 and 

Vt,j 

Vmax ^ max^_^ Vt^rnax- 

Theorem 8. Suppose the perceptron algorithm receives a sequence of instances {Xi , iii),..., {Xt , Rt)- 
Let k be the cut-off point of NDCG. Also, for any w G let ft{w) be as defined in Eq. ( [T5| ), but 

NDCG^. 

with , Rt) = Then, the following inequality holds, after optimizing 

over learning rate rj. 


Y,{l-NDCGk{sf\Rt)) < 

t=i 


T 

E 

t=i 


ft{w) + 4:\\w\\lkRj^v„ 


^ ‘iWwWlkRxVr, 


t=i 


( 22 ) 
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In the special case where there exists w s.t. ft{w) = 0, V t, we have 

T 

- NDCGkisT,Rt)) < A\\w\\lkR\v^^^. (23) 

t=i 

Discussion: Assuming maximum relevance level is finite, we have Umax = 0(log(A:)) (using 
definition of Thus, the constant term in the perceptron bound for NDCGk induced loss 

is: 4||u|p/c log(/c)i?^. This is a significant improvement from original error term, even though the 
perceptron algorithm is running on queries with m documents, which can be very large. A margin 
dependent bound can be defined in same way as before. 


Proof. We remind again that ranking performance measures only depend on the permutation of 
documents and individual relevance level. They do not depend on the identity of the documents. 
Documents with same relevance level can be considered to be interchangeable, i.e., relevance levels 
create equivalence classes. Thus, w.l.o.g., we assume that Ri > R 2 > ■ ■ ■ > Rm and documents 
with same relevance level are sorted according to score. Also, means position of document 

i in permutation tt. 

We define 


NDCGfc 

V. 


G(Ri)D{i) 

Zk(R) 

0 


if i = 1, 2 ,..., /c 
if i = /c + 1,..., m. 


(24) 


We now prove the upper bound property that 2?) > 1 — NDCGfc(s, R) holds V s, for 

every R. We have the following equations: 


= 1 a„dNDCG.{..ii) = < *=) 


Zk(R) 

1 - NDCGfc (s,i?) = 


Zk{R) 

YZi G{Ri){D{^)l{^ <k)- < k)) 

Zk{R) 


For i > k: < k) = 0 and since D{7:f^{i)) is non-negative, every term in 1 — 

NDCGfc(s, R) is non-positive for i > k. 

For i k, there are four possible cases: 

1. i > and 7r7 ^(t) > k. This is infeasible since i < k. 

2. i > and 7r7^(i) < k. In this case, the numerator in 1 — NDCGfc is G{Ri){D{i) — 

D{'Kf^{i))). Now, since Df) is a decreasing function, the contribution of the document i to 
NDCG induced loss is non-positive and can be ignored (since SLAM by definition is sum of 
positive weighted indicator functions). 


2. i < Ttg ^{i) and vr^ ■i(^) > k. In this case, the numerator in 1 — NDCGfc is G{Ri)D{i). 
Since i < that means document i was outscored by a document j, where i < j 

(otherwise, document i would have been put in a position same or above what it is at currently, 
by TTg, i.e, i > 7r“^(i).) Moreover, Ri > Rj (because of the assumption that within same 
relevance class, scores are sorted). Hence the indicator of SLAM at i would have come on 


and V. 


NDCGfc _ G{Ri)(D{i) 


Zk{R) 
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A. i < TTg ^(i) and vr^ ^(i) < k. In this case, the numerator in 1 — NDCG^ is G{Ri){D{i) — 

D{'K~^{i))). By same reason as c.), the indicator of SLAM at i would have come on and 

definition of ^nd the fact that D{i) > 

Zk[R) 

Dii) - 

Hence, the upper bound property holds. 

The proof of Theorem now follows directly following the argument in the proof of Lemma 
by noting a few things: 

Yli=i = 1 . b) same structure as different 

weights. Hence structure of zt remains same but with weights of ^;NDCGfe 

Hence, 1st bound on gradient of zt in proof of Lemma remains same. For the 2nd bound 
on gradient of zt, the crucial thing that changes is that — kvt^ma.xVt,i'C^ii, with the new 

definitions of ft,max according to tiNDCG*; jdis implies 2Rx X] < ‘ARx k ft,max (p^^ (■8^% ^t)- 

□ 


6. Minimax Bound on Cumulative NDCG/AP Induced Loss 

We discuss a lower bound achievable on separable dataset and another perceptron like algorithm 
achieving the bound. 


6.1 Lower Bound 

The following theorem gives a lower bound on the cumulative NDCG/AP induced loss, achievable 
by any deterministic online Algorithm. 


Theorem 9. Suppose the number of documents per query is m > 2 and relevance vectors are 

j^2 

restricted to being binary graded. Let X = {X £ ||Aij :||2 < Rx} tind < d. Then, for 

any deterministic online algorithm, there exists a ranking dataset which is separable by margin 7 

H?' 

21 1, on which the algorithm supers D( ) cumulative NDCG/AP induced loss. 


(Eq. 


Proof. Let T = — 1. Since ^ < d, hence T + 1 < d and (T + 1 ) 7 ^ < Let a ranking 

dataset consist of the following T document matrices, for 1 < i < T: 




Rx ■ eLi 
-Rx ■ el+i 
Rx ■ ej 


e A, 


(25) 


Rx ■ ej 


where e* is the unit vector of length d with 1 in zth coordinate and 0 in others. These document 
matrices are presented to a deterministic algorithm A in order. 

The relevance vectors for the dataset are set as follows: for matrix Xi,ifA puts the 1st document 
at position 1 then Ri i = 0,22 *,2 = 1- Otherwise, Ri i = 1,22 *,2 = 0. In either case, 22*,^ = 0 
for j > 2. With this choice, note that 22 *,1 > 22 j ,2 iff 22* = (1,0,0,..., 0)^ and 22 *,1 < 22*,2 iff 
22* = (0,1,0,...,0)T. 
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We have to make sure that, irrespective of what A does, we can always find a unit norm weight 
vector such that the dataset is actually separable with margin 7. Let a ranking function parameter 
G be defined as follows: = 2 ^^, 


w*,i = 


7 

2-Rx 


-7 


2-Rx 


if Ri^i > Ri^2 

ofherwise 


for 2 < f < T + 1. 


For T + 1 < f < d, sef = 0. The unif norm condifion holds because ||m *||2 
The margin condifion holds as follows. Fix i G [T]. If Ri^i > Ri^ 2 , then 


(T±lh^ 


< 1 . 


XiW ^ = (7/2, -7/2, -7/2,..., -7/2). 


Otherwise, if Ri^i < Ri^ 2 , then 


XiW -, = (-7/2, +7/2, -7/2, ■ ■ •, -7/2). 

Therefore, in either case, tu* scores the only relevant document above all irrelevant document by a 
margin of exactly 7. 

It is clear that, for the above dataset, A will make a ranking mistake in each round. But we 
need to argue a bit more: we need to show that the NDCG/AP induced loss per round will be 0(1) 
on each round. Note that a mistake by itself does not guarantee a constant loss incurred since the 
minimum possible non-zero loss for these loss functions is dependent on m. 

We have two cases to consider. First, when A puts document 1 at the top. Note that, in this case, 
Ri = (0,1, 0,..., 0)^. The least loss A incurs in such a scenario is when it puts document 2 in 
position 2. Therefore, AP is at most 1/2 and NDCG is at most which means that 1 — AP 

and 1 — NDCG are both 11(1). In the second case, A does not put document 1 at the top. In this 
case, Ri = (1, 0, 0,..., 0)^ which means that an irrelevant document gets placed at the top. The 
least loss A incurs in this scenario is when it puts document 1 in position 2. The AP/NDCG induced 
losses therefore have again the same minimum values in this case as in the previous one. Since 
the loss incurred in either of the two types of mistakes in 11(1), we conclude that the cumulative 
NDCG/AP induced loss will bell(r) = ll([^J). □ 


6.2 Algorithm Achieving Lower Bound 

We will show that the lower bound established in the previous section is actually the minimax bound, 
achievable by another perceptron type algorithm. Thus, Algorithm]^ is sub-optimal in terms of the 
bound achieved, since it has a dependence on number of documents per query. 

Our algorithm is inspired by the work of Crammer and Sing^ |2002 1. Following their work, we 
define a new surrogate via a constrained optimization problem for ranking as follows: 


4>c{s, R) = min (5 

s.t. 5 > 0 , Si + 6 > A + Sj, A Ri > Rj^ V i,j- 

The above constrained optimization problem can be recast as a hinge-like convex surrogate: 


(26) 


4>cis, R) = max max 1 [R{i) > R{j)] (1 + Sj — Si)_ 

ie[m] ie[m] 


(27) 
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The key difference between the above surrogate and the previously proposed SLAM family of 
surrogates is that the above surrogate does not adapt to different ranking measures. It also does not 
exhibit the listwise property since it treats an incorrectly ranked pair in a uniform way independent 
of where they are placed by the ranking induced by s. 

Similar to Algorithm]^ we define a sequence of losses ft{w), defined on a sequence of data 
follows: 


fL,d = / ^c{sf,Rt) ifRML(sr,^t)/0 
’ \ 0 ifRML(s^‘,i?t) = 0 


(28) 


Here, sf = Xtw and wt is the parameter produced by Algorithm at time t, with the adaptive 
adversary being influenced by the move of perceptron. Note that ft{wt) > RML(s^‘, Rt), since, 
ft{w) is always non-negative and if RML(s^‘, Rt) > 0, there is at least one pair of documents 
whose scores do not agree with their relevances. At that point, the surrogate value becomes greater 
than 1. 

During a mistake round, the gradient z is calculated as follows: let i*,j* be any pair of indices 
that achieve the max in Eq. 27 Then, 


z = V^<t>c{s^,R)=X'^{{-ei,+ej>)l[R{n>R{j*)W + Sj>-Si, > 0 ]}. 


(29) 


Note that if there are multiple index pairs achieving the max, then an arbitrary subgradient can 
be written as a convex combination of subgradients computed using each of the pairs. 


Algorithm 3 New Perceptron Algorithm Achieving Lower Bound 
Learning rate r/ > 0, rui = 0 G M'^. 

Lor t = 1 to T 

Receive Xt (document list for query qt). 

Set = Xtwt , predicted ranking output pt = argsort(s^‘). 

Receive Rt 

If RML(sJ"S / 0 //Note: RML{s^\Rt) = RML(argsort(sJ"*), Rt) 
wt+i = Wt — rjzt // zt is defined in Eq. ( [29l ) 

else 

Wt+l = Wt 

End For 


We have the following loss bound for Algorithm 

Theorem 10. Suppose Algorithm^receives a sequence of instances {Xi, i?i),..., {Xt, Rt)- Let 
Rx be the bound on the maximum I 2 norm of the feature vectors representing the documents and 
ft{w) be as defined in Eq. Then the following inequality holds, after optimizing over learning 

rate r],\/w£ M'^.' 






- y^Jtjw) + 2\\w\\2Rx 


t=l 


t=l 




T 

E 

t=i 


ft{w) + 4||'u;||27?x ■ 


(30) 
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In the special case where there exists w s.t. ftiw) = 0, V t, we have 

T 

Y,RML{s^\Rt) < ^w\\lR\. 

t=i 


Proof. We first bound the ^2 norm of the gradient. From Eq. 29 we have: 

1st bound for zp. 


(31) 


zth <11^7 ||i^ 2 ||{(-ei* + e,-*)l [i?(r) > Rif)] 1 [1 + sj* 


-St* > 0]}||i < 2i2x. 


2nd bound for zt: 

On a mistake round, since there exists at least 1 pair of documents, whose scores and relevance 
levels are discordant. Hence, ■, R) > 1- Hence, \\zt \\2 < < 2Rx(l)c{sf\ Rt)- 

Thus, \\zt\\l < AR^fcisffRt). Since || zt\\2 = 0 on non-mistake round, we finally have: 

\\zt\\l < 4:Rj^ft{wt), V t. 

The proof then follows as previous: by plugging in expression for |p in OGD equation (Prop. 
OGD Regret), optimizing over t], using the algebraic trick: x — by/x — c < 0 x < 6^ + c+6-y/c 
and then using the inequality ft{wt) > RML(s7*, Rt)- D 


As before, we can immediately derive a margin based bound. 


Corollary 11. If the margin condition (21 1 holds, then total loss, for both NDCG and AP induced 

iB? 

loss, is upper bounded by a bound independent of the number of instances in the online 

sequence. 

Proof. Proof is similar to that of Corollary]^ □ 


Importance of learning rate parameter p: Algorithm also has the learning rate parameter 
p embedded, and the optimal upper bound on loss is obtained by optimizing over p. However, like 
classification perceptron, and unlike Algorithm]^ the performance is independent of p. To see this, 
we once again use prediction pt = argsort(X]jg;v(^ —pXt ■ zf). The prediction is independent of 
p if Zi is independent of Wi. Once again, we consider, during a mistake round, two score vector 
and s^, where If subgradient 2 ;, during a mistake round, is indeed independent of w (and 

hence score X • w), then z is same for both and During a mistake round, there is at least one 



Comparison of Algorithm and Algorithm Both of our proposed perceptron-like algo¬ 
rithms can be thought of analogues of the classic perceptron in the learning to rank setting. Al¬ 
gorithm achieves the minimax optimal bound on separable datasets, unlike Algorithm whose 
bound scales with number of documents per query. However, Algorithm operates on a surro¬ 
gate (Eq [T7| ) which is not listwise in nature, even though it forms an upper bound on the listwise 
ranking measures. To emphasize, the surrogate does not differentially weigh between errors at dif¬ 
ferent points of the ranked list, which is an important property of popular surrogates in learning to 
rank. As our empirical results show (Section [^, on commercial datasets which are not separable. 
Algorithm]^ has significantly worse performance than Algorithm]^ 
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7. Related Work in Perceptron for Ranking 


There exist a number of papers in the literature dealing with perceptron in the context of ranking. 
We will compare and contrast our work with existing work, paying special attention to the papers 
whose setting come closest to ours. 

First, we would like to point out that, to the best of our knowledge, there is no work that es¬ 
tablishes a number of documents independent bound for NDCG, cut-off at the top k position (The¬ 
orem [^. Moreover, we believe our work, for the first time, formally establishes minimax bound, 
achievable by any deterministic online algorithm, in the learning to rank setting, under the assump¬ 
tion of separability. 


Crammer and Singer [2001| were one of the first to introduce perceptron in ranking. The setting 


as well as results of their perceptron are quite different from ours. Their paper assumes there is a 
fixed sef of ranks {1,2,..., k}. An instance is a vector of the form x G and the supervision 
is one of the k ranks. The perceptron has to learn the correct ranking of x, with the loss being 1 
if correct rank is not predicted. The paper does not deal with query-documents list and does not 
consider learning to rank measures like NDCG/AP. 

The results of Wang et al. 120151 have some similarity to ours. Their paper introduces algorithms 
for online learning to rank, but does not claim to have any “perceptron type” results. However, their 
main theorem (Theorem 2) has a perceptron bound flavor to it, where the cumulative NDCG/AP 
losses are upper bounded by cumulative surrogate loss and a constant. The major differences with 
our results are these: Wang et al. |20151 consider a different instance/supervision setting and conse¬ 
quently have a different surrogate loss. It is assumed that for each query q, only a pair of documents 
{xi,Xj) are received at each online round, with the supervision being {-|-1,—1}, depending on 
whether Xi is more/less relevant than Xj. The surrogate loss is defined at pair of documents level, 
and not at a query-document matrix level. Moreover, there is no equivalent result to our Theoremj^ 
neither is any kind of minimax bound established. 

The recent work of |Jain et al.| [2015| also contains results similar to ours. One the one hand 
their predtron algorithm is more general. But on the other hand, the bound achieved by predtron, 
applied to the ranking case, has a scaling factor O(m^), significantly worse than our linear scaling. 
Moreover, it does not have the NDCG^ bounds scaling as a function of k proved anywhere. 

There are other, less related papers; all of which deal with perceptron in ranking, in some 
form or the other. Ni and Huang| 120081 introduce the concept of margin in a particular setting, 
with corresponding perceptron bounds. However, their paper does not deal with query-document 
matrices, nor NDCG/AP induced losses. The works of |Elsas et al.| |2008|| and |Harrington|| ||2003 1 


introduce online perceptron based ranking algorithms, but do not establish theoretical results. Shen 


and Joshi 12005 1 give a perceptron type algorithm with a theoretical guarantee, but in their paper. 


the supervision is in form of full rankings (instead of relevance vectors). A few recent papers deal 
with generalization ability of online learning algorithms with pair-wise surrogates [Wang et~ST 


2012| Kar et al.j 2013), online AUG optimization | Gao et ahj 2013) and optimization at top ranked 


position I Li et al. 2014) However, none of the papers are related to perceptron for learning to rank. 


8. Experiments 

We conducted experiments on a simulated dataset and three large scale industrial benchmark datasets. 
Our results demonstrate the following: 
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• We simulated a margin 7 separable dataset. On that dataset, the two algorithms (Algorithm 
and Algorithm ranks all but a finite number of instances correctly, which agrees with our 
theoretical prediction. 

• On three commercial datasets, which are not separable. Algorithm]^ shows competitive per¬ 
formance with a strong baseline algorithm, indicating its practical usefulness. Algorithm 
performs quite poorly on two of the datasets, indicating that despite minimax optimality under 
margin separability, it has limited practical usefulness. 


Baseline Algorithm: We compared our algorithms with the online version of the popular List- 
Net ranking algorithm |Cao et aL| 2007). ListNet is not only one of the most cited ranking al¬ 


gorithms (over 800 citations according to Google Scholar), but also one of the most validated al¬ 
gorithms [Tax et aL 2015) . We conducted online gradient descent on the cross-entropy convex 
surrogate of ListNet to learn a ranking algorithm in an online manner. While there exists ranking 
algorithms which have demonstrated better empirical performance than ListNet, they are generally 
based on non-convex surrogates with non-linear ranking functions. These algorithms cannot be 
converted in a straight forward way (or not at all) into online algorithms which learn from stream¬ 
ing data. We also did not compare our algorithms with other perceptron algorithms since they do 
not usually have similar sehing to ours and would require modifications. Wfe emphasize that our 
objective is not simply to add one more ranking algorithms to the huge variety that already exists. 
Our experiments on real data are to show that Algorithm^has competitive performance and has a 
major advantage over Algorithm^ due to the difference in the nature of surrogates being used for 
the two algorithms. 

Experimental Setting: For all datasets, we report average NDCGio and average AP over a 
time horizon. Average NDCGio at iteration t is the cumulative NDCGio up to iteration t, divided 
by t (same for average AP). We remind that at each iteration t, a document matrix is ranked by the 
algorithm, with the performance (according to NDCGio or AP) measured against the true relevance 
vector corresponding to the document matrix. For all the algorithms, the corresponding best learning 
rate rj was fixed affer conducting experimenfs wifh mulfiple differenf rafes and observing fhe besf 
fime averaged NDCGio/ AP over a fixed fime inferval. 

Simulated Dataset: We simulated a margin separable dataset (Eq. ([2T])). Each query had 
m = 20 documents, each document represented by 20 dimensional feature vector, and five dif¬ 
ferent relevance level {4, 3,2,1, 0}, with relevances distributed uniformly over the documents. The 
feature vectors of equivalent documents (i.e., documents with same relevance level) were generated 
from a Gaussian distribution, with documents of different relevance levels generated from different 
Gaussian dishibution. A 20 dimensional unit norm ranker was generated from a Gaussian dis- 
hibution, which induced separability with margin. Eig. [T] compares performance of Algorithm 
Algorithmj^and online EistNet. The NDCGio values of the perceptron type algorithms rapidly con¬ 
verge to 1, validating their finite cumulative loss property. To re-iterate, since for separable datasets, 
cumulative NDCG induced loss is bounded by constant, hence, the time averaged NDCG should 
rapidly converge to 1. The OGD algorithm for EistNet has only a regret guarantee of 0{s/t)', hence 
the time averaged regret converges at rate 0{^), i.e., its convergence is significantly slower than 
the percephon-like algorithms. 

Commercial Datasets: We chose three large scale ranking datasets released by the industry 
to analyze the performance of our algorithms. MSER-WEB lOK |Eiu et al. 2007) is the dataset 
published by Microsoft’s Bing team, consisting of 10,000 unique queries, with feature dimension 
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Figure 1: Time averaged NDCGio for Algorithmj^ Algorithm]^ and ListNet, for separable dataset. 
The two perceptron-like algorithms have imperceptible difference. 


of size 245 and 5 distinct relevance levels. Yahoo Learning to Rank Challenge dataset IChapelle and 


Chang! 2011) consists of 19,944 unique queries, with feature dimension of size 700 and 5 distinct 


relevance levels. Yandex, Russia’s biggest search engine, published a dataset (link to the dataset 
given in the work of Chapelle and Chang [20111) consisting of 9124 queries, with feature dimension 
of size 245 and 5 distinct relevance levels. Since AP is suited to binary relevance vectors, we 
converted the multi-graded vectors to binary vectors when comparing algorithms based on AP. All 
documents with non-zero relevance grade were considered relevant for the purpose of conversion. 

Algorithm 1^ performs better than ListNet on MSLR-WEB dataset (average NDCG@10 over 
last ten iterations= 0.25 vs 0.22, average AP over last 10 iterations= 0.49 vs 0.37), performs slightly 
worse on Yahoo dataset (average NDCG@10 over last ten iterations= 0.75 vs 0.74, average AP 
over last ten iterations = 0.875 vs 0.87) and has overlapping performance on Yandex dataset. See 
Fig.j^and Fig.j^for AP and NDCG results respectively. The experiments validate that our proposed 
perceptron type algorithm (Algorithm has competitive performance compared to online ListNet 
on real ranking datasets, even though it does not achieve the theoretical lower bound. Algorithm 
performs quite poorly on both Yandex and Yahoo datasets. One possible reason for the poor per¬ 
formance is that the underlying surrogate (Eq. [27] ) is not listwise in nature. It does not put more 
emphasis on errors at the top and hence, is not very suitable for a listwise ranking measure like 
NDCG, even though it achieves the theoretical lower bound on separable datasets. 


9. Conclusion 

We proposed two perceptron-like algorithms for learning to rank, as analogues of the perceptron for 
classihcation algorithm. We showed how, under assumption of separability (i.e., existence of a per¬ 
fect ranker), the cumulative NDCG/AP induced loss is bounded by a constant. The hrst algorithm 
operates on a listwise, large margin family of surrogates, which are adaptable to NDCG and AR 
The second algorithm is based on another large margin surrogate, which does not have the listwise 
property. We also proved a lower bound on cumulative NDCG/AP loss under a separability condi¬ 
tion and showed that it is the minimax bound, since our second algorithm achieves the bound. We 
conducted experiments on simulated and commercial datasets to corroborate our theoretical results. 
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Figure 2: Time averaged AP for Algorithm]^ Algorithm]^ and ListNet for 3 commercial datasets. 


An important aspect of perceptron type algorithms is that the ranking function is updated only 
on a mistake round. Since non-linear ranking functions are generally have better performance than 
linear ranking functions, an online algorithm learning a flexible non-linear kernel ranking function 
would be very useful in practice. We highlight how perceptron’s “update only on mistake round” 
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Figure 3: Time averaged NDCGio for Algorithm Algorithm and ListNet for 3 commercial 
datasets. 


aspect can prove to be powerful when learning a non-linear kernel ranking function. Since the score 
of each document is obtained via inner product of ranking parameter w and feature representation of 
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document x, this can be easily kemelized to learn a range of non-linear ranking functions. However, 
the inherent difficulty of applying OGD to a convex ranking surrogate with kernel function is that 
at each update step, the document list {X matrix) will need to be stored in memory. For moderately 
large dataset, this soon becomes a practical impossibility. One way of bypassing the problem is to 


approximately represent the kernel function via an explicit feature projection [Rahimi and Recht 


2007 [ Le et al.[|2013| . However, even for moderate length features (like 136 for MSWEBIOK), the 


projection dimension becomes too high for efficient computation. Another technique is to have a 
finite budget for storing document matrices and discard carefully chosen members from the budget 
when budget capacity is exceeded. This budget extension has been studied for perceptron in clas¬ 
sification IDekel et al. 2008 Cavallanti et al.[ 2007) . The fact that perceptron updates are only on 
mistake rounds leads to strong theoretical bounds on target loss. For OGD on general convex surro¬ 
gates, the fact that function update happens on every round leads to inherent difficulties when using 
their kemelized versions [ Zhao et ^|2012| (the theoretical guarantees on the target loss are not as 
strong as in the kemelized perceptron on a budget case). The results presented in this paper open 
up a fruitful direction for further research: namely, to extend the perceptron algorithm to non-linear 
ranking functions by using kernels and establishing theoretical performance bounds in the presence 
of a memory budget. 
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Appendix A. 

Proof of TheoremjH 

Proof for AP: As stated previously, documents pertaining to every query are sorted according to 
relevance labels. We point out another critical property of AP (for that matter any ranking measure). 
AP is only affected when scores of 2 documents, which have different relevance levels, are not 
consistent with the relevance levels, as long as ranking is obtained by sorting scores in descending 
order. That is, if Ri = Rj, then it does not matter whether Si > sj or Sj < Sj. So, without loss 
of generality, we can always assume that within same relevance class, the documents are sorted 
according to scores. That is, if Ri = Rj with i < j, then Si > Sj. The without loss of generality 
holds because SLAM is calculated with knowledge of relevance and score vector. Thus, within 
same relevance class, we can sort the documents according to their scores (effectively exchanging 
document identities), without affecting SLAM loss. 

Let R G be an arbitrary binary relevance vector, with r relevant documents and m — r 
irrelevant documents in a list. AP loss is only incurred if at least 1 irrelevant document is placed 
above at least 1 relevant document. With reference to in Eq. Q, for any i > r + 1 and 

V j > i, we have l(i?i > Rj) = 0, since Ri = Rj = 0. For any i > r + 1 and V j < i, 
l{Ri > Rj) = 0 since documents are sorted according to relevance labels and Ri = 0, Rj = 1. 
Thus, w.l.o.g., we can take Vr+i, = 0, since indicator in SLAM loss will never turn on for 

i > r + 1. 

Let a score vector s be such that an irrelevant document j has the highest score among m 
documents. Then, ip'sLAM ~ ~ ®i) + '^ 2(1 + Sj — S2) + ... + t'r(l + Sj — Sr). The 

maximum possible AP induced loss in case at least one irrelevant document has higher score than 
all relevant documents is when all irrelevant documents outscore all relevant documents. The AP 
loss in that case is: 1 - y+ rn-r+2 + •• + m-r+r )- has to upper bound AP 

Vs (for each R) and since Sj can be infinitesimally greater than all other score components (thus, 
1 + Sj — Sj ~ 1, V f = 1,..., r), we need the following equation for upper bound property to hold: 

V 1 +V 2 + .■.+Vr>l- + rri-r+2 + •• + m-r+r)' 

Similarly, let a score vector s be such that an irrelevant document j has higher score than all but 
the 1st relevant document. Then = V2{1 + Sj — S 2 ) + ^ 3(1 + Sj — S 3 ) + ... + ^,.(1 + Sj — 

Sr). The maximum possible AP induced loss in this case occurs when all irrelevant documents are 
placed above all relevant documents except the first relevant document. The AP loss in that case is: 
1 - y (1 + m-r +2 + m-r +3 + " + ^-r+r )' Following same line of logic for upper bounding as 
before, we get 

V2+VZ + + m-r+2 + m-r+3 + " + m-r+r)' 

Likewise, if we keep repeating the logic, we get sequence of inequalities, with the last inequality 
being 

t>r > 1 ~ -{f — 1 H- 

Now, it can be easily seen that our definition of satisfies fhe inequalities. 

Proof for NDCG: 

We once again remind fhaf TT~^{i) means posifion of documenf i in permufafion vr. Thus, if 
documenf i is placed af position j in vr, fhen TT~^{i) = j. Moreover, tike AP, we assume fhaf 
Ri > R 2 > ■ ■ ■ > Rm and fhaf wifhin same relevance class, documenfs are sorted according fo 
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score. We have a modified definition of NDCG, for k = m, which is required for the proof: 


NDCG(s, R) = 




where G{r) = 2’’ - 1, D{i) = Z{R) = maxYT=i G{Ri)D{n 

We begin the proof: 

1 - NDCG(s, R) 

^ ra - m 

= ^)YlG{R^)D{i) - ^^Y.G{R^)D{7:-\i)) 

^ m 

= Z{R)J^^ " ^(^.■'(^))) 

Now, D{i) = Jpg is a decreasing function of i. D{i) — D{7r~^{i)) is positive only if 
i < 7r~^{i). This means that document i in the original list, is placed at position 7r7^(t), which 
comes after i, by sorted order of score vector s. By the assumption that indices of documents within 
same relevance class are sorted according to their scores, this means that document i is outscored 
by another document (say with index k) with lower relevance level. At that point, the function 
max(0, max > Rj){l + Sj — Si)}) turns on with value at least 1 (i.e., (1 + Sfc — s* > 1)) 


and with weight vector 


,NDCG _ G{Ri)D{i) 


. We can now easily see the upper bound property. 
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